Skip to content

Q6_K - Block Interleaving Implementation for x86 SIMD (AVX512/AVX2)#19706

Open
Manogna-Sree wants to merge 3 commits intoggml-org:masterfrom
Manogna-Sree:Q6_K_blockinterleaving_implementation
Open

Q6_K - Block Interleaving Implementation for x86 SIMD (AVX512/AVX2)#19706
Manogna-Sree wants to merge 3 commits intoggml-org:masterfrom
Manogna-Sree:Q6_K_blockinterleaving_implementation

Conversation

@Manogna-Sree
Copy link
Copy Markdown
Contributor

Make sure to read the contributing guidelines before submitting a PR

PR #15275 which was previously submitted by us, includes repacking and block interleaving for the Q6K nodes. However, with the latest master, we observed inaccuracies related to the usage of repacked scales present in the master branch.

This PR fixes the inaccuracy issues and contains block interleaving approach for Q6_K quantization for x64/x86 SIMD Architecture
Initial gains were observed with prompt processing with the above changes compared to the tested Q6_K model
The GEMM function was implemented for AVX512/AVX2 and GEMV functions are implemented for the AVX2 architecture

Model Size Params Backend Threads Test t/s (mean ± std) Speedup Commit id
llama 7B Q6_K 5.15 GiB 6.74 B CPU 6 pp 512 40.02 ± 0.06 e9a859d - Base Commit
llama 7B Q6_K 5.15 GiB 6.74 B CPU 6 pp 512 46.72 ± 0.08 16.79% a3020c0 - AVX2 Version
llama 7B Q6_K 5.15 GiB 6.74 B CPU 6 pp 512 61.61 ± 0.13 54% a3020c0 - AVX512 Version
llama 7B Q6_K 5.15 GiB 6.74 B CPU 6 tg 128 10.41 ± 0.00 e9a859d - Base Commit
llama 7B Q6_K 5.15 GiB 6.74 B CPU 6 tg 128 10.11 ± 0.00 -2.88% a3020c0 - AVX2 Version
llama 7B Q6_K 5.15 GiB 6.74 B CPU 6 tg 128 10.104 ± 0.00 -2.94% a3020c0 - AVX512 Version

GCC Version = 12.3

The PR was tested in AMD Granite Ridge 9600X which supports the following flags by default :

system_info: n_threads = 6 (n_threads_batch = 6) / 12 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |

Additionally, the PR was tested for execution with clang on both Linux and windows

The perplexity results with llama2 7B are tabulated as follows:

Model Perplexity (Final estimate PPL) Commit id
llama 7B Q6_K 5.8164 ± 0.03250 e9a859d - Base Commit
llama 7B Q6_K 5.8163 ± 0.03250 a3020c0 - Updated Commit

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Feb 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant